Getting started

In this section, we will see how to use PySCD to track history changes in pandas DataFrame objects. For more detailed information about Slowly Changing Dimensions, you can check this page on Wikipedia.

Importing packages

Befor starting you need to import the public objects os pandas and pyscd packages. You do that by executing:


In [1]:
import pandas as pd
import pyscd

Creating a dimension from scratch

Use one of the pandas IO tools to read your data (from CSV, HDF5, ...)


In [2]:
df = pd.read_csv('clients 2015-01.csv')
df


Out[2]:
ssn first_name last_name email state
0 230-06-2120 Barbara Perez bperez0@nih.gov CA
1 889-07-1337 Fred Kennedy fkennedy1@nytimes.com FL
2 221-33-1718 Pamela Fields pfields2@mlb.com MA

Now, lets create a dimension from this dataframe:


In [3]:
dim = pyscd.SlowlyChangingDimension(
    df, source_keys='ssn', as_of='2015-01-01')
dim.df


Out[3]:
ssn first_name last_name email state scd_id scd_valid_from scd_valid_to
0 230-06-2120 Barbara Perez bperez0@nih.gov CA 0 2015-01-01 2199-12-31
1 889-07-1337 Fred Kennedy fkennedy1@nytimes.com FL 1 2015-01-01 2199-12-31
2 221-33-1718 Pamela Fields pfields2@mlb.com MA 2 2015-01-01 2199-12-31

I used the as_of parameter here to indicate that this data from January. When as_of is ommited, the dimension will always use the current date.

Updating dimension with new data

Now lets import a new clients table:


In [4]:
df = pd.read_csv('clients 2015-02.csv')
df


Out[4]:
ssn first_name last_name email state
0 230-06-2120 Barbara Perez bperez0@nih.gov CA
1 889-07-1337 Fred Kennedy fkennedy1@nytimes.com NY
2 221-33-1718 Pamela Fields pfields2@mlb.com MA

Note that Fred has moved from FL to NY.

Before updating the dimension with this new data, lets indicate that this time the data is from February.


In [5]:
dim.as_of = '2015-02-01'

In [6]:
dim.update(df)

In [7]:
dim.df


Out[7]:
ssn first_name last_name email state scd_id scd_valid_from scd_valid_to
0 230-06-2120 Barbara Perez bperez0@nih.gov CA 0 2015-01-01 2199-12-31
1 889-07-1337 Fred Kennedy fkennedy1@nytimes.com FL 1 2015-01-01 2015-02-01
2 221-33-1718 Pamela Fields pfields2@mlb.com MA 2 2015-01-01 2199-12-31
3 889-07-1337 Fred Kennedy fkennedy1@nytimes.com NY 3 2015-02-01 2199-12-31